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IMPROVING THE NUMERICAL STABILITY OF 
FAST MATRIX MULTIPLICATION 


GREY BALLARD*, AUSTIN R. BENSONt, ALEX DRUINSKY*, BENJAMIN LIPSHITZ§, 

AND ODED SCHWARTZ^ 

Abstract. Fast algorithms for matrix multiplication, namely those that perform asymptotically 
fewer scalar operations than the classical algorithm, have been considered primarily of theoretical 
interest. Apart from Strassen’s original algorithm, few fast algorithms have been efficiently im¬ 
plemented or used in practical applications. However, there exist many practical alternatives to 
Strassen’s algorithm with varying performance and numerical properties. Fast algorithms are known 
to be numerically stable, but because their error bounds are slightly weaker than the classical algo¬ 
rithm, they are not used even in cases where they provide a performance benefit. 

We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical 
use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy 
both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends 
on properties of the algorithm and of the input matrices, and we consider both contributions inde¬ 
pendently. We generalize and tighten previous error analyses of fast algorithms and compare their 
properties. We discuss algorithmic techniques for improving the error guarantees from two per¬ 
spectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal 
scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy. 
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1. Introduction. After Strassen’s discovery of an algorithm for dense matrix- 
matrix multiplication in 1969 [25] that reduced the computational complexity from 
the classical 0(N 3 ) (for multiplying two N x N matrices) to 0(lV log27 ), there has 
been extensive effort to understand fast matrix multiplication, based on algorithms 
with computational complexity exponent less than 3. From a theoretical perspective, 
there remains a gap between the best known lower bound [21] and best known upper 
bound [14] on the exponent. From a practical perspective, it is unlikely that the 
techniques for obtaining the best upper bounds on the exponent can be translated 
to practical algorithms that will execute faster than the classical one for reasonably 
sized matrices. In this paper, we are interested in the numerical stability of practical 
algorithms that have been demonstrated to outperform the classical algorithm (as 
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well as Strassen’s in some instances) on modern hardware [3]. 

Nearly all fast matrix multiplication algorithms are based on recursion, using 
a recursive rule that defines a method for multiplying matrices of fixed dimension 
Mo x Kq by Ko x No MqKqNq scalar multiplications. In this work, we use the 
notation (Mo, Ko, No) for these algorithms. For practical algorithms, these fixed 
dimensions need to be very small, typically Mo, Ko, No < 10, as they define the factors 
by which the dimensions of subproblems are reduced within the recursion. Many such 
algorithms have been recently discovered [3, 24]. Most fast algorithms share a common 
bilinear structure and can be compactly represented by three matrices that we denote 
by [U, V, W], following the notation of Bini and Lotti [4]. Many key properties of the 
practicality of an algorithm, including its numerical stability, can be derived quickly 
from its [U, V, W] representation. We also note that, because recursive subproblems 
are again matrix multiplications, different recursive rules can be combined arbitrarily. 
Following the terminology of Ballard et al. [2] and Demmel et al. [12], we refer to 
algorithms that vary recursive rules across different recursive levels and within each 
level as non-uniform, non-stationary algorithms. If an algorithm uses the same rule 
for every subproblem in each recursive level but varies the rule across levels, we call 
it a uniform, non-stationary algorithm; those defined by only one rule are called 
stationary algorithms. 

Fast matrix multiplication is known to yield larger numerical errors than the clas¬ 
sical algorithm. The forward error guarantee for the classical algorithm is component¬ 
wise: the error bound for each entry in the output matrix depends only on the dot 
product between the corresponding row and column of the input matrices. Fast algo¬ 
rithms perform computations involving other input matrix entries that do not appear 
in a given dot product (their contributions eventually cancel out), and therefore the 
error bounds for these algorithms depend on more global properties of the input ma¬ 
trices. Thus, fast algorithms with no modification are known to exhibit so-called 
norm-wise stability [4] (sometimes referred to as Brent stability [23]) while the clas¬ 
sical algorithm exhibits the stronger component-wise stability, which is unattainable 
for fast algorithms [23]. 

Our main goals in this paper are to explore means for improving the theoretical 
error bounds of fast matrix multiplication algorithms as well as to test the improve¬ 
ments with numerical experiments, focusing particularly on those algorithms that 
yield performance benefits in practice. For computing C = A • B, where A is M x K 
and B is I\ x N, norm-wise stability bounds for full recursion take the following form: 

|[C-C||</ alg (A)||A||||B||e + 0(e 2 ), 

where || ■ || is the max-norm, e is the machine precision, and / a i g is a polynomial 
function that depends on the algorithm [4, 12, 16]. 1 For example, f a i g (K) = K 2 for the 
classical algorithm, with no assumption on the ordering of dot product computations. 
We note that / a i g is independent of the input matrices, and ||A||||B|| is independent 
of the algorithm. In this paper, we explore ways of improving each factor separately. 
Our main contributions include: 

1. generalizing and tightening previous error analysis of stationary fast algo¬ 
rithms to bound / a i g in terms of the number of recursive steps used and two 
principal quantities derived from [U, V, W]; 


1 Here and elsewhere, the 0(e 2 ) term hides dependence on dimensions and norms of input matri¬ 


ces. 
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2. presenting and comparing the stability quantities of recently discovered prac¬ 
tical algorithms; 

3. exploring means of improving algorithmic stability through algorithm selec¬ 
tion and non-uniform, non-stationary combination of algorithms; 

4. presenting diagonal scaling techniques to improve accuracy for inputs with 
entries of widely varying magnitudes; and 

5. showing empirical results of the effects of the various improvement techniques 
on both error and performance. 

The structure of the remainder of the paper is as follows. We describe related 
work in section 2 and introduce our notation for fast matrix multiplication algorithms 
in section 3. Section 4 presents the error analysis for bounding / a i g for general fast 
algorithms, and section 5 discusses the implications of the bounds on known practical 
algorithms. We present diagonal scaling techniques in section 6, showing how to 
reduce the contribution of the input matrices to the error bound. Finally, we discuss 
our results in section 7 and provide directions for improving implementations of fast 
matrix multiplication algorithms. 

2. Related Work. Bini and Lotti [4] provide the first general error bound for 
fast matrix multiplication algorithms, and their analysis provides the basis for our 
results. Demmel et al. [12] generalize Bini and Lotti’s results and show that all fast 
algorithms are stable. A more complete summary of the numerical stability of fast 
algorithms, with a detailed discussion of Strassen’s algorithm along with Winograd’s 
variant, appears in Higliam’s textbook [16, Chapter 23]. We discuss these previous 
works in more detail and compare them to our error bounds in section 4. 

Castrapel and Gustafson [8] and D’Alberto [9] discuss means of improving the nu¬ 
merical stability of Strassen’s algorithm (and Winograd’s variant) using the flexibility 
of non-uniform, non-stationary algorithms. Castrapel and Gustafson propose general 
approaches to such algorithms, and D’Alberto provides a specific improvement in the 
case of two or more levels of recursion. 

Smirnov [24] describes strategies for discovering practical fast algorithms and 
presents several new algorithms, including a rank-23 algorithm for (3,3,3) with the 
fewest known nonzeros and an algorithm for (6, 3,3) that yields a better exponent than 
Strassen’s. Similar techniques are used by Benson and Ballard [3], and they demon¬ 
strate performance improvements over the classical and Strassen’s algorithm for both 
single-threaded and shared-memory multi-threaded implementations. Laderman et 
al. [20] and later Kaporin [18, 19] consider another form of practical algorithms, ones 
that can achieve fewer floating point operations than the Strassen-Winograd vari¬ 
ant for certain matrix dimensions. Kaporin demonstrates better numerical stability 
than Strassen-Winograd and shows comparable performance. However, because the 
base case dimensions proposed are relatively large (e.g., 13 or 20), we suspect that 
the performance will not be competitive on today’s hardware. Further, because the 
[U, V, W] representations are not readily available, we do not consider these types 
of algorithms in this work. 

Dumitrescu [13] proposes a form of diagonal scaling to improve the error bounds 
for Strassen’s algorithm. We refer to his approach as outside scaling and discuss it in 
more detail in section 6. Higham [16] points out that inside scaling can also affect the 
error bound but does not propose a technique for improving it. Demmel et al. [11] 
and Ballard et al. [1] state (without proof) improved error bounds using either inside 
or outside diagonal scaling. 
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3. Fast Matrix Multiplication Algorithms. Fast algorithms for matrix mul¬ 
tiplication are those that perform fewer arithmetic operations than the classical al¬ 
gorithm in an asymptotic sense, achieving a computational complexity exponent less 
than 3 for the square case. We consider such fast algorithms to be practical if they 
have been (or likely can be) demonstrated to outperform the most efficient implemen¬ 
tations of the classical algorithm on current hardware [3]. From a practical viewpoint, 
because matrices arising in current applications have limited size, we can consider a 
fast algorithm’s recursive rule being applied only a few times. In light of this view¬ 
point, we state our algorithms (and error bounds) in terms of the number of recursive 
levels rather than the dimension of the base case, where the number of recursive lev¬ 
els need not be considered a fixed quantity. In the rest of this section, we state the 
notational and terminology of the fast algorithms we consider in this paper. 

3.1. Base Case Algorithms. A bilinear non-commutative algorithm that com¬ 
putes a product of an Mq x Kq matrix and a Kq x Nq matrix (C = AB) using R 
non-scalar (active) multiplications is determined by a MqKq x R matrix U, a KqNq x R 
matrix V, and a MqNq x R matrix W such that 

R M 0 K 0 KqNq 

(1) Cfc = ^^Wfc r TO r , where m r := s r ■ t r , s r := Ui r a,i , t r := Vjrbj , 

r= 1 i— 1 j =1 

for k = 1,..., MqNq. Here, the single indices of entries of A and B assume column- 
major order, the single indices of entries of C assume row-major order, and (•) signifies 
an active multiplication. We will refer to the dimensions of such an algorithm with 
the notation (Mo, Kq, Nq) , the rank of the algorithm by R , and the set of coefficients 
that determine the algorithm with the notation [U, V, W]. 

3.2. Stationary Algorithms. Now we consider multiplying an M x I\ matrix 
A by a A' x A matrix B. We will assume that M, AT, and N are powers of Mo, 
Ao, and No', otherwise, we can always pad the matrices with zeros and the same 
analysis will hold. The fast algorithm proceeds recursively by first partitioning A 
into Mq x A'o submatrices of size (M/Mo) x (AT/A'o) and B into A'o x Nq submatrices 
of size ( K/Kq ) x ( N/Nq ) and then following (1) by matrix blocks, i.e., 

R Mo Kq Kq No 

(2) Cfc = ^aifrM,., where M r := S r ■ T r , S r := 'y^ Uj r Aj, T r := 'y^ Vjr^j 

r= 1 2—1 j =1 

for k = 1,..., MqNq , where (•) signifies a recursive call to the algorithm. Here, we are 
using single subscripts on matrices as an index for the column- or row-major ordering 
of the matrix blocks. The algorithms in this class of fast matrix multiplication are 
called stationary algorithms because they use the same algorithm at each recursive 
step [12]. However, we do not assume that stationary algorithms recurse all the way 
to a base case of dimension 1; we assume only that the base case computation (of 
whatever dimension) is performed using the classical algorithm. Thus, a stationary 
algorithm is defined by the triplet of matrices [U,V,W] along with a number of 
recursive levels L used before switching to the classical algorithm. 

3.3. Uniform, Non-Stationary Algorithms. In contrast to the stationary 
algorithms discussed above, uniform, non-stationary algorithms employ a different 
fast algorithm, in the sense of (1) and (2), at each recursive level [2]. The fast 
algorithm is the same at a given recursive level. Specifically, we will consider uniform, 
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non-stationary algorithms with L steps of recursion, so the algorithm is specified by 
matrices U [ ' ] , V [l] , W w of dimensions x R®, x RM, M^ ] N^ ] x R®, 

for l = 1,..., L. 

Uniform, non-stationary algorithms are of particular interest for maximizing per¬ 
formance. The fastest algorithm for a particular triplet of dimensions M, K , and N 
may depend on many factors; the same algorithm may not be optimal for the recur¬ 
sive subproblems of smaller dimensions. Assuming performance is fixed for a given 
triplet of dimensions, the flexibility of non-stationary algorithms allows for perfor¬ 
mance optimization over a given set of fast algorithms. However, in parallel and more 
heterogeneous settings, better performance may be obtained by the greater generality 
of non-uniform, non-stationary algorithms, described in the next section. 

3.4. Non-Uniform, Non-Stationary Algorithms. The final class of matrix 
multiplication algorithms we consider are non-uniform, non-stationary algorithms. 
In contrast to the previous case, non-uniform, non-stationary algorithms use different 
algorithms within a single recursive level as well across recursive levels [2] , though we 
restrict the dimension of the partition by the fast algorithms at a given recursive level 
to be the same. To define such algorithms, we must specify [U, V, W] for every node 
in the recursion tree, a total of 1 + RM + RWrI 2 ! + • • • + ^ recursive rules. 

We use superscript notation [Z, r*i, 7*2 9 - ,n_i] to denote a recursive node at level l, 

in the top-level subtree rr, second level subtree r 2 , and so on. 

We demonstrate in Subsection 4.5 that the flexibility of these algorithms allows 
for an improvement in the numerical stability of multi-level recursive algorithms. 
We suspect that they also provide a performance benefit over stationary algorithms, 
though this has never been demonstrated empirically. 

4. Error Analysis. The work by Bini and Lotti [4] provides the basic framework 
for the forward error analysis of fast matrix multiplication algorithms. They provide 
general bounds for any square, stationary bilinear algorithm with power-of-two coeffi¬ 
cients (so that there is no error in scalar multiplications), assuming that full recursion 
is used (a base case of dimension 1). Demmel et al. [12] extend the work of Bini and 
Lotti by (1) accounting for errors induced by the scalar multiplications in bilinear 
algorithms, (2) analyzing uniform, non-stationary bilinear fast matrix multiplication 
algorithms (algorithms that use different fast matrix multiplication routines at differ¬ 
ent levels of recursion), and (3) analyzing group-theoretic fast matrix multiplication 
algorithms. The bounds provided by Demmel et al. also assume square algorithms 
and that full recursion is used. Higham [16] provides bounds for Strassen’s original 
algorithm as well as Winograd’s variant in terms of the base case dimension no, where 
the recursion switches to the classical algorithm. Higliam’s bounds are also slightly 
tighter (in the case of Strassen’s and Winograd’s algorithms) than the general bounds 
previously mentioned. We note that any matrix multiplication algorithm satisfying 
the component-wise error bound must perform at least N 3 arithmetic operations; that 
is, we cannot get the same component-wise error bounds even when using just one 
step of recursion of a fast algorithm [23]. 

The goal of the error analysis provided in this section is to generalize the previous 
work in two main directions and to tighten the analysis particularly in the case that 
nonzeros of U, V, and W are not all ±1. First, we will consider rectangular fast 
algorithms; that is, instead of considering recursive rules for multiplying two K x K 
matrices, we consider the more general set of rules for multiplying an M x I\ matrix 
by a K x N matrix. Second, we will state our general bounds in terms of the number 
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of levels of recursion used. Motivated by the results of recently discovered practi¬ 
cal algorithms [3, 24], we would like to understand the theoretical error guarantees 
of an algorithm in terms of its [U, V, W] representation. The recent performance 
results show that rectangular algorithms have practical value (even for multiplying 
square matrices) and that, for performance reasons, typically only a small number of 
recursive steps are used in practice. Several recently discovered practical algorithms 
include fractional power-of-two coefficients (e.g., 1/2, 1/4, 1/8), and we expect that 
other currently undiscovered useful algorithms will include fractional coefficients that 
are not powers of two. Therefore, we make no assumptions on the entries of U, V, and 
W, and we derive principal quantities below that can be tighter than the analogous 
quantities in the previous works by Bini and Lotti [4] and Demmel et al. [12], par¬ 
ticularly in the case of fractional coefficients. This sometimes leads to much sharper 
error bounds (see Example 4). 

Finally, we point out that our representation of non-uniform, non-stationary al¬ 
gorithms is more convenient than previous work. Careful choices of non-uniform, 
non-stationary algorithms have been shown to improve the numerical stability over 
stationary approaches (see Example 6) [9] . Bini and Lotti’s bounds [4] can be applied 
to such algorithms in terms of the global [U,V,W] representation, but the size of 
that representation grows quickly with the number of recursive levels. Our represen¬ 
tation (see subsection 3.4) and error bound (see subsection 4.5), given in terms of the 
base case rule used at each node in the recursion tree, allows for more efficient search 
of combinations of rules and has led to automatic discovery of more stable algorithms 
(see Example 7). 

After defining the principal quantities of interest and specifying our model of 
computation, the rest of this section provides forward error bounds for each of the 
types of fast algorithms defined in section 3. We warn the reader that there are 
notational similarities and (sometimes subtle) inconsistencies with previous work, as 
a result of our tightening of the analysis. 

4.1. Principal quantities. Following the approach of Bini and Lotti [4], we 
identify two principal quantities associated with a fast algorithm that, along with 
the dimensions of the algorithm and the number of levels of recursion, determine its 
theoretical error bounds. The two principal quantities can be easily computed from 
the [U, V, W] representation, and we define them in terms of the following vectors: 

MqKq k 0 n 0 r 

(3) a r := ^2 K U ir^ 0) Pr := K V jr 7 ^ 0 ) 7 k ■= Kw kr / 0 ) 

i— 1 j =1 r =1 


M 0 K 0 K 0 N 0 

( 4 ) ci r := 'y ] \Ui r \ b r := ^ ) Vjr 

*=1 1=1 

for r = 1,..., R and k = 1,..., MqNq, where I is the Boolean-valued indicator func¬ 
tion with value 1 for true and 0 for false. That is, a is the vector of numbers of 
nonzeros in the columns of U, j3 is the vector of numbers of nonzeros in the columns 
of V, 7 is the vector of numbers of nonzeros in the rows of W, a is the vector of 
column 1-norms of U, and b is the vector of column 1-norms of V. When U and V 
have ±1 entries, a = a and /3 = b. 

Definition 1. The prefactor vector q is defined entry-wise by 
Qk = 7fc + max(ay + p r )l(w kr £ 0) 

r 


(5) 
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for k = 1,..., M 0 N 0 , and the prefactor Q is defined as 


Q = ma xq k . 
k 

Definition 2. The stability vector e is defined entry-wise by 

R 

(6) e k = y^a r -b r - \w kr \ 

r—1 


for k = 1,..., MqNq, and the stability factor E is defined as 


E = maxefc. 

k 


The principal quantities for several fast algorithms are listed in Table 1. 

Bini and Lotti [4] provide a definition of q for two different summation algo¬ 
rithms: sequential summation and serialized divide-and-conquer (see subsection 4.2). 
We choose the looser of these two bounds (sequential summation) for generality and 
simpler notation. However, our results are easily converted to the tighter case. Dem- 
mel et abuse the serialized divide-and-conquer algorithm in their analysis. Bini and 
Lotti’s analysis does not account for scalar (non-active) multiplication by elements of 
U, V, and W, so their E parameter depends only on the non-zero structure, rather 
than the magnitude of the elements in these matrices (cf. (4) and Definition 2). Dem- 
mel et al.do account for the multiplication by elements of U, V, and W. However, 
their E parameter is identical to that of Bini and Lotti, and their bound includes an 
additional factor of (\\U\\ ||V|| ||bF||) L , where L is the number of recursive levels and 
|| • || is the max-norm. 

4.2. Model of arithmetic and notation. We follow the notation of Demmel et 
al. [12]. Let 0 = {6 \ |0| < e} be the set of all errors bounded by e (machine precision) 
and let A = {1 + 8 \ 8 £ 0}. We assume the standard model of rounded arithmetic 
where the computed value of op(a, b ) is op(a, b)(l + 8) for some 8 £ 0. We use the set 
operation notation: A + B := {a + b \ a € A, b £ B }, A — B := {a — b \ a £ A, b £ B}, 
and A ■ B := {a ■ b \ a £ A, b £ B}. 

We define = A ■ A ■... ■ A and note that A- 5 C A J+1 as 1 £ A. Furthermore, we 
will not distinguish between singleton sets and an element when using this notation, 
e.g., op(a, 6) (1 + 0) £ op (a, b) A. Finally, we will use the standard hat or fl(-) notation 
to denote a computed value, e.g., C or fl(op(a,b)) £ op(a,b) A. 

Under this arithmetic, the following fact for summation will be useful in our 
analysis 


( 7 ) 


( N \ f N \ 

^2fK c i ' «i) ) € ( ' ai 

i=l / \*=1 / 


A 


N 


where the algorithm for summation is simply to accumulate the terms a,; one at a 
time, in sequential order. By using a serialized divide-and-conquer summation, we 
can also achieve 


( 8 ) 


fl 





^l+[log 2 A] 


For generality, we will assume the more pessimistic bound in (7). Our results can 
easily be modified for the error bounds in (8). 
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Table 1: Principal quantities for a variety of fast matrix multiplication al¬ 
gorithms. The rank of the algorithm (R) drives the asymptotic complexity, 
and the total number of non-zeroes in the U, V, and W (nnz) affects the 
constant in the complexity. Likewise, the E parameter drives the asymptotic 
behavior of the stability bound and the Q parameter affects the constant 
(see Theorem 3). The stability exponent (stab, exp.) denotes the asymptotic 
stability of the algorithm assuming square matrix multiplication (see (12)), 
which allows for comparison of algorithms with different base case sizes. 


(M 0 ,K 0 ,N 0 ) 

ref. 

m 0 k 0 n 0 

R 

nnz 

Q 

E 

stab. exp. 

(2,2,2) 

(classical) 

8 

8 

24 

4 

2 

1 

(2,2,2) 

[25] 

8 

7 

36 

8 

12 

3.58 

(3,2,2) 

[25]* 

12 

11 

48 

8 

12 

3.03 

(2,3,2) 

[25]* 

12 

11 

48 

9 

13 

3.03 

(4,2,2) 

[25]* 

16 

14 

72 

8 

12 

2.94 

(2,4,2) 

[25]* 

16 

14 

72 

12 

24 

2.94 

(3,2,3) 

B 

18 

15 

94 

10 

20 

3.21 

(3,3,2) 

B 

18 

15 

94 

11 

23 

3.21 

(3,3,3) 

[24] 

27 

23 

139 

15 

29 

3.07 

(4,2,3) 

[3] 

24 

20 

130 

14 

34 

3.38 

(3,4,2) 

[3] 

24 

20 

130 

14 

30 

3.38 

(2,3,4) 

[3] 

24 

20 

130 

14 

35 

3.38 

(4,4,2) 

C 

32 

26 

257 

22 

89 

3.90 

(4,2,4) 

C 

32 

26 

257 

23 

92 

3.93 

(3,4,3) 

[3] 

36 

29 

234 

23 

100 

3.66 

(3,3,4) 

[3] 

36 

29 

234 

18 

71 

3.66 

(3,3,6) 

[24] 

54 

40 

960 

39 

428 

4.69 

(3,6,3) 

[24] 

54 

40 

960 

48 

728.5 

4.69 


These algorithms correspond to straightforward generalizations of 
Strassen’s (2, 2, 2) algorithm, either using two copies of the algorithm or 
one copy of the algorithm combined with the classical algorithm. 


We will also use the following property: 

(9) fl ^ CjA°^ £ ^ A Ar+maXj . 

4.3. Forward error analysis of stationary algorithms. The following theo¬ 
rem states the forward error bound for a stationary algorithm in terms of the principal 
quantities Q and E defined in subsection 4.1, which can be readily determined from its 
[U, V, W] representation. The sources of error are floating point error accumulation 
and possible growth in magnitude of intermediate quantities. The floating point error 
accumulation depends in part on Q and grows at worst linearly in L. The growth of 
intermediate quantities depends on E and grows exponentially in L, which typically 
dominates the bound. Table 1 shows the values of these quantities for a variety of 
algorithms. 

Theorem 3. Suppose that C = A ■ B, where A £ and B £ M. KxN is 
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computed by using L recursive steps of the fast matrix multiplication in (2), with the 
classical algorithm used to multiply the (M/AIq) x (K/Kq) matrices by the (K/I\q) x 
(N/Nq) matrices at the base cases of the recursion. Then the computed matrix C 
satisfies 

||C - C|| < (K/Kq + Q-L ) (K/K{f ) • -E' l ||A||||B|| e + 0(e 2 ), 
where || • || is the max-norm. 

Proof. We begin by analyzing how relative errors propagate as we form the S 
and T matrices. Let a superscript index in brackets denote a matrix formed at the 
specified level of recursion. Following (7), we have the following error at the first 
recursive level: 


M 0 K 0 k 0 n 0 

s™ G £ UirAi A Q ", i?!! 1 G VjrB.A^, 
i= 1 1=1 

where a and /3 are defined in (3). 

This error propagates as we recurse. At the 1th level of recursion, the inputs to 
the fast algorithm are given as sums of matrices A^ and B^,, each with a possible 
error of A“ and A b , respectively, for some index sets (f> and %p and some integers a 
and b. Following (2) and (7), the algorithm simply accumulates an additional factor 
of A a o and A^n before the matrices are passed to the subsequent level of recursion. 
Thus, at the Lth level of recursion, we have 

(10) S l r L] G S l r L] A a ^+- + °‘^, f^ ] G tWa^i+'-'+^s 

with r = ri + (r 2 — 1 )R + • • • + (tl — 1 )R L ~ 1 . Note that in exact arithmetic, 

Kf Ng 

(11) Sj. 1 = Ui iri ■ ■ ■ Ui L r L Aj, T[ 1 = y] Vj iri ■ ■ ■ Vj L r L Tij, 

i= 1 1=1 

where i = i 1 + (i 2 - 1 )M 0 K 0 H-f (*l - l)(M 0 A'' 0 ) i_1 and j = ii + ( j 2 - 1)A" 0 A 0 + 

■ ■ ■ + {jl — 1)(A' 0 A 0 ) i_1 represent recursive orderings of the subblocks of A and B. 

We now use the classical algorithm to multiply the computed and ma¬ 
trices at the leaves of the recursion. Because the inner dimension of each leaf-level 
matrix multiplication is K/Kff. from (7) and (10) we accumulate another factor of 
A K / K o to obtain 


G S l r L] T [ r L] A x - +K / K o, 

where \ r = a ri + 0 ri + • • • + a rL + (3 rL for 1 < r < R L . 

Next, the computed matrices are added to form C following(2). At the 1th 
level of recursion, sums of matrices M^, for appropriate index sets <j> and including 
accumulated error A a for some integers a, are added together to form the intermediate 
computed quantities . In the final step at the top of the recursion tree, we have 
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where 7 is as defined in (3). Following (9), if mJ, ^ € MWA lr for some integers x r , 
then 


R 

C k G ^ wt r M| 1] A 7fc+maXr ' x *"Hwkr¥:0) _ 

r=1 

Likewise, a factor of A 7fc < is accumulated at every recursive step, and the con¬ 
tributed error from the matrices comes from the leaf (that is involved in the 

summation) with maximum error. Leaf matrix is involved in the summation 

for C k if Wh lTl ■ ■ ■ Wk L r L 7 ^ 0, where r = n + (r 2 - 1 )R + • • • (tl — l)A i_1 and 
k = ki + (fc 2 — 1 )M 0 N 0 H-+ ( k L — l)(M 0 fVo) i_1 . Thus, we have 

r l 

C k £ ^ VJ kiri ■ ■ ■ W kirL A w+maXr Xr-i(.w kir . 1 -w kL r L ^o)+k/k£ , 

r—1 

where fi k = 7 fci H-b lk L ■ 

Let 8k = Hk + max r \r ■ I(wfe iri • • • Wk L r L 7 0) + K/Kq . In order to determine 
the largest accumulated error, we compute the maximum over all output blocks C^: 


ki,...,k L 


max 8 k = AT/ Kq + max \/i k + max \r • lOu+m ’ ’' w k L r L 7 0) 




= K/Kq + max i 7 fej + max(a ri + P ri )l{w kiri 7 0 ) f + 


+ max \ 7 kL + max(a ri + p rL )l(w kLrL 7 0 ) 


= K/Kq + max 7 kl + max(a n + /3 ri )I(w kiri 7 b 0 ) \ ■ L = K/I\q + Q ■ L 


where Q is given in Definition 1. 

We now compute the forward error bound for each block of the output matrix. 
We have E k = C k — C k £ E r w k x r x • • • iu kirL M.\ L ^Q Sk , which implies (using (11)) 


IE,. 


^E 

r—1 


W kir 


■W kir 


c[L]rp[L] 
vJ r -L , r 


S k £ + 0(e 2 ) 


r l 

M k K k 


< ■ 

'' w kl r L \ Km • 

u iLr L II A* I 'y ] Km ' ■ ■ v jLTL 1 |Bj 

r—1 


3 = 1 

+ 0(e 2 ) 

R l 


KW 


' ' W kl r L \ Km' 

u i L r L 1 ^ ' Km ‘ ' ' v jLTL 1 

r—1 


1=1 

(A7A^)||A||||B||4e + 0(e 2 ) 


Let = J2r \ w k iri ■ ■ ■ w kl r L I J2i \uiin ■ ■ ■ Ui L r L I Ej Km ' ■ ■ Vj hTL \. In order to de¬ 
termine the largest intermediate quantity, we compute the maximum over all output 
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blocks Ck'- 

max = max | w kiri ■ ■ ■ w khrL | ^ \u iiri ■ ■ ■ u iLrL | ^ l^n •'' v jLrL \ 

ri,...,r L h,...,iL jif-rfL 

max E lEl^lE^ml 

ri %i 3 i 

max E ItOfc 

’"L iL 3L 

max E E kir-iiEi^ii 

ri H Ji 

where E is given in Definition 2. 

Computing max*, |E*,| by maximizing over S k and separately, we obtain our 
result. We note that the two quantities may not achieve their maxima for the same k , 
but we ignore the possible looseness as the overall bound will typically be dominated 
by the value of E. □ 

Note that if L = log^ 0 K (full recursion), the bound in Theorem 3 becomes 
||C-C|| < (1 + Q-L) ■ E loSK o K \\A\\\\B\\e + 0{e 2 ) 

which is the bound provided by Demmel et al. [12], assuming Mq = Kq = No, 
M = K = N, all nonzeros of U have the same value, all nonzeros of V have the same 
value, and all nonzeros of W have the same value. If L = 0 (no recursion), we get the 
familiar bound 






|!C-C|| <^ 2 ||A||||B||e + 0(e 2 ). 


Example 4. Because our definition of E (Definition 2) accounts for the magni¬ 
tude of the entries U, V , and W in situ, the bound from Theorem 3 can be tighter 
than previous analyses [4, 12] when U, V, or W has entries outside of {—1,0,1}. 
As an example, we consider a (4,4,2) algorithm, where the U and W matrices have 
entries in {—0.5,0.5} [3] (see C). For this algorithm, E according to Definition 2 is 
89, while E according to previous work is 125. 


4.4. Forward error analysis of uniform, non-stationary algorithms. Re¬ 
call that uniform, non-stationary algorithms use a single algorithm at each recursive 
level. We denote the prefactor vector, stability vector, and partition dimensions of 


and Mq\ Kq\ and NqK Using a 


algorithm U^,V^,W^ at level l by qM, eW, 
similar analysis to subsection 4.3, we get the following stability bound for this class 
of algorithms: 


Theorem 5. Suppose that C = A ■ B is computed by a uniform, non-stationary 
algorithm with L recursive steps of fast matrix multiplication, with the fast algorithm 




used at level l and the classical algorithm used to multiply the mu- 
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trices at the base case of the recursion. Then the computed matrix C satisfies 


1C — CM < 


K 


,nf=i k { 


■E« [ ' 


K 


l—l -“o l -1 


t\L4 ] . 


n^ 1 ||A||||B||e + 0(e 2 ). 


q=i 


Proof. The proof is similar to the proof of Theorem 3. The largest accumulation 
error 5 now satisfies 

max 4 = K + max j 7 [ 1] + max(a[, 1 1 1 + /3W)I(wj. 1] ^ 0)1 + • • • 

k n,-, Kk l fc i i ri 11 j 


+ max | 7 ^ ] + max(a [ r L J + 4? )!(«&. ^ 0)| = 


K 


nf=i^( 


+y: q w , 


0 Z=1 


and the largest intermediate growth quantity f satisfies 
id 1 ] 


max fk = 
k 


max 

fci 


E KJ E 


1 - 1=1 


max 
k 


d [1] K [1] 
J o /v o 


K [1] N [1] 
-^o iV o 

N 

i 


E 

1 a tir 1 

i E 

\v [1] 1 

I jxn i 

... 


21=1 


3 1 =1 

> 

r 


r [ l] 




\ 

L 

£ i diJ 

E 

i«5j 

E i«J 

=n^ ] 


r l —1 


i L — l 


3L= 1 


i=l 


4.5. Forward error analysis of non-uniform, non-stationary algorithms. 

We now consider non-stationary algorithms where the algorithm may be non-uniform 
at every given recursive level of fast matrix multiplication. That is, at any node in the 
recursion tree, we may choose a different fast algorithm. For simplicity, we assume 
that at level l in the recursion tree, all algorithms have the same partitioning scheme 


and rank (so that the 


representations have 


■jj[Z,ri,...,ri_i] -y[Z,n,...,ri_i] ^y[Z,n,...,ri_i] 

the same dimensions across all values r %,..., rj_i) and that after L levels of recursion, 
all leaf nodes use the classical algorithm. 

In the case of stationary algorithms, one [U, V, W] defines the entire algorithm; 


in the case of uniform non-stationary algorithms, L choices of , W T 


define 


the entire algorithm; in this case, we have much more flexibility and can choose 
1 + + • • • + different fast algorithms (the number of internal 

nodes of the recursion tree). Recall that we use the notation [Z,ri,r 2 ,... ,H-i] as a 
superscript to refer to the algorithm used at level l in the recursion tree, where rq 
defines subtree membership at level 1, r 2 defines subtree membership at level 2, and 
so on, and n_i defines the subtree node at the Zth level. 

Our analysis of these algorithms is fundamentally the same—we bound the ac¬ 
cumulated error (<5) and then bound the number of terms (£). However, maximizing 
over all output blocks is not as straightforward and cannot be simplified as cleanly 
as in the previous cases. In particular, we define the largest accumulation error 8 
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recursively as max*, , where 


4 1] = 


K 


nL, '<f 


■iff 


max S 


[2,ri] 




k i71 


^ 0 ), 


c[2,n] [2,n] , <43,7*1, 7 * 2 ] -tt/ [2,t*i] / n \ 

4 = 7fc 2 + max5[ • I ^ 0), 


4' ,; " = 7& Pl, -’ n - 11 + naxf 1 ^ • I(«&. n - l] 4 0) 


r[L,n,- 


■Ti-i] _ [L.ri, 
>k L 

Xr = a™ 


,,ri d +maxxr • ’ ri- ^T^ 0 )’ and 


' ft} + 


Tl 

v [2,n] 


•4 2 / l] 




■44 




This expression does not simplify as before. Note that for block k of the output 
matrix, node (tt, ... ,n_i) at level l of the recursion tree accumulates error for the 
additions/subtractions required by matrix w^ ,ri, ‘"’ T ’ I_d at that node plus the max¬ 
imum accumulated error from any of the combined terms. The expression for \r 
reflects the number of additions and subtractions required to produce the factor ma¬ 
trices S,4 and T,[4 at the leaf nodes, and the error accumulated during the classical 
matrix multiplications is included in the definition of 54 • 

Likewise, the largest intermediate growth quantity £ is maxfc^, where 


j ^kiri W k2r2 

[L,r 

Wj 

kj_, v l 

■,r L -l] 



ri,...,r L 






• E 

[1] [2,71] 

[ L,r ! 
" U i L r L 


• E 

1 2,71 [L,ri,...,rL-i] 

V V V ■ 

Jl7l 3272 jLr L 





ii--.iL 


which we can simplify to 





& = E HE 

«-E 

1 [2,7*!] 

\Wj 
k 2 r 2 

[2,7*!] 7 [2,7*1] ... 

a r 2 u r 2 


7*1 

r2 








Z-^ I W k L r L 

i] [L,ri,...,ri_i] 7 [L,ri,...,r_L_i 

u r L u r L 


r L 


where a and b vectors are defined as in (4). Note that we cannot simplify further as 
in the uniform case. 

In Section 5.2, we use non-uniform, non-stationary algorithms to improve the 
numerical stability of fast matrix multiplication algorithms. 

5. Algorithm selection. Theorem 3 immediately provides several options for 
improving the numerical stability of fast matrix multiplication. First, we can look 
for algorithms with a smaller Q and E. Since prior work on finding fast algorithms 
focuses on performance, this provides a new dimension for algorithm design. And 
in subsection 5.1, we compare several stationary algorithms for the same base case 
as a first step in this dimension of algorithm design. We then extend this analysis 
to non-uniform, non-stationary algorithms in subsection 5.2. Second, we can reduce 
the number of recursive levels before using standard matrix multiplication However, 
fewer recursive levels means an asymptotically slower algorithm. We examine this 
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tradeoff in subsection 5.3. Finally, we can also reduce ||A|| and ||S|| by pre- and 
post-processing the data, and we provide several such strategies in section 6. 

5.1. Searching for better stationary algorithms. Typically, the only quan¬ 
tity of interest for finding fast matrix multiplication algorithms is the rank of the 
solution, which controls the asymptotic complexity. However, we can also search for 
algorithms to minimize the Q and E quantities while maintaining the same rank. This 
will improve the numerical stability of the algorithm without sacrificing (asymptotic) 
performance. We will also consider the number of non-zeros (nnz) in the solution, i.e., 
the sum of the number of non-zero entries in U, V, and W, as this affects the constant 
in the asymptotic complexity and has noticeable impact on empirical performance [3] . 
Thus, the parameters of interest for these algorithms is a performance-stability 3-tuple 
(nnz, Q, E). In general, the number of non-zeros is positively correlated with Q and 
E, since these quantities directly depend on the non-zero patterns of U, V, and W 
(see (5) and (6)). 

We first examined the base case (4,2,3), which has out-performed Strassen’s 
algorithm in practice [3]. We found 479 algorithms with rank R = 20 using nu¬ 
merical low-rank tensor decomposition search techniques [3]. Of these, there were 
208 performance-stability tuples. The smallest nnz, Q , and E quantities over all 
algorithms were 130, 12, and 32, and the corresponding algorithms had performance- 
stability tuples (130, 14, 34), (138, 12, 34), and (134, 13, 32). No algorithm we found 
had parameters that achieved more than one of these minima, so we call these three 
algorithms semi-optimal. Consequently, there is a theoretical trade-off between per¬ 
formance and stability. We note that although this list of algorithms is not exhaustive, 
they are the only publicly available (4, 2,3) algorithms. 2 

We tested the stability of these algorithms by computing the product of samples 
of random matrices A £ ]g>4096x256 anc [ g e ]g>256x2i87. qq ie distributions were a^, 
bij ~ Uniform(0,1) and a^-, bij ~ Uniform(—1,1). In addition to the three semi- 
optimal algorithms described above, we also compared against an algorithm with a 
much worse performance-stability tuple of (156, 26, 132), which we call a sub-optimal 
algorithm. For each pair of matrices, we ran the four algorithms with number of 
recursive levels L = 1, 2,..., 6. Our goal here is to compare the errors of different 
algorithms with the same base case and varying number of recursive levels—we are not 
claiming that any of these algorithms are the best to use for these problem dimensions. 

To estimate ||C — C||, we computed C using the classical algorithm in quadru¬ 
ple precision arithmetic. All other computations used double precision arithmetic. 
Overall, we computed the errors for 100 random pairs A and B for each distribution. 
Figure 1 reports the maximum error over the 100 trials for each algorithm and each 
number of recursive levels as well as the upper bound on the error from Theorem 3. 
We see the following results: 

1. The error bounds are still pessimistic, even with the improved analysis from 
Theorem 3. Furthermore, the error bounds for the three semi-optimal (4, 2,3) 
algorithms are quite similar. 

2. The true error increases with the number of recursive levels, as predicted by 
Theorem 3 and modeled by the error bound. 

3. For both distributions, the sub-optimal algorithm has larger errors than the 
semi-optimal algorithms, as modeled by the error bound. 

4. The difference between the semi-optinral algorithms depends on the matrices. 

2 All of our algorithms, as well as the software for finding them, are publicly available at https: 
//github.com/arbenson/fast- mat mul. 
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Number of recursive levels (L) Number of recursive levels (L) 

Fig. 1: Error for four (4,2,3) fast matrix multiplication algorithms with different 
stability parameters and the classical algorithm as a function of the number of re¬ 
cursive levels, L. Three algorithms are semi-optimal in the sense that they minimize 
one of the following quantities: number of non-zeros, Q , or E. The solid curves are 
the maximum experimental error over 100 pairs of random matrices, and the corre¬ 
sponding markers are the upper bounds from Theorem 3. The experimental error 
increases with L, as modeled by Theorem 3. The semi-optimal algorithms with mini¬ 
mal nnz, Q , and E all have similar performance, but the fast algorithm with a worse 
performance-stability tuple is noticeably less stable in theory and practice. 


For the Uniform(0,1) distribution, there is a clear difference in error between 
the algorithms. Interestingly, the (134, 13, 32) semi-optimal algorithm has 
larger errors than the (130, 14, 34), even though the former algorithm has 
strictly better Q and E parameters. For the Uniform(—1,1) distribution, the 
errors of the semi-optimal algorithms are practically indistinguishable. 

We also considered the (2,3,2) base case, which has optimal rank R = 11 [5]. 
One known algorithm that achieves the optimal rank uses Strassen’s algorithm on a 
2x2 sub-block and classical matrix multiplication on the remaining sub-blocks. The 
base case of the algorithm is small enough so that we could use a SAT solver [10] to 
find over 10,000 rank 11 (2,3,2) algorithms (ignoring symmetries). We found that the 
combination of Strassen’s algorithm with the classical algorithm had a strictly smaller 
performance-stability triple than all of the other rank 11 solutions. We conclude that 
this algorithm is likely optimal in both a performance and a stability sense for the 
class of (2,3,2) algorithms where the scalar multiplications are ±1. 

5.2. Searching for better non-uniform, non-stationary algorithms. Sta¬ 
tionary algorithms benefit from their simplicity, but non-uniform, non-stationary algo¬ 
rithms provide a broader search space for algorithms with better numerical properties. 
We provide several examples below. 

Example 6. D’Alberto [9] describes a non-uniform, non-stationary approach us¬ 
ing Strassen’s algorithm that obtains a smaller stability factor than the original sta¬ 
tionary algorithm (for L > 2). Strassen’s algorithm, with [U,V,W] as given in A, 
has stability vector e = [12 4 4 12]; two levels of recursion with a stationary 

approach yields a two-level stability vector of e (8> e with maximum entry 12 2 = 144. 
D’Alberto shows that, for L = 2, a stability factor of 96 can be obtained with a non- 
uniform approach using one variant of Strassen’s algorithm. One way to achieve this 
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stability factor is to use the alternative algorithm 


_ _ ^ - 


]" ( 

'0 1 


1 o' 

\ ( 

'l o' 


'0 1' 

\ „ r 1 

u,v,w 

— 

H 

1 0 

0 

0 1 

M 

0 1 

0 

1 0 

j w 


for nodes [2,1], [2,3], and [2,4] of the recursion tree, while using the original algorithm 
at nodes [1], [2,2], [2,5], [2,6], and [2,7]. Similar improvements can be made based 
on the Strassen-Winograd algorithm, which has a slightly larger stability factor. 

A more generic non-uniform approach is described in a patent by Castrapel and 
Gustafson [8]. They consider eight variants of the Strassen-Winograd algorithm, de¬ 
fined by 


0 

1 


0 


0 

1 



ll* 

o 


0 

1 



i\ v 

0 


0 

1 



w 


with x,y,z £ {1,2}. The correctness of these variants can be derived from the work of 
Johnson and McLoughlin [17, Equation (6)]. Castrapel and Gustafson suggest using 
random, round-robin, or matrix-dependent selections of algorithms to more evenly 
distribute the error, but they do not prove that any particular techniques will reduce 
the stability factor. 

Example 7. We can improve the two-level stability factor for the (3,2,3) case 
in a similar manner. The smallest stability factor we have discovered for this case is 
E = 20, given by the [U, V, W] in B, which has stability vector 


e = [20 20 2 12 4 20 4 12 20] . 


Compared to a uniform two-level stability factor of 20 2 = 400, we can achieve a 
stability factor of 352 using 3 variants of the algorithm. We use the original algorithm 
at nodes [1], [2,2], [2,6], [2,8], [2,14], and [2,15], the variant 


If 

'1 

0 

o' 

h 0 

0 

0 

1 

Iv 

0 

1 

0 


U, 


1 0 
0 0 
0 1 


0 

1 

0 



1 0 0 
0 0 1 
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'1 

0 

o' 

\ 1 
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0 

! J 


at nodes [2,1], [2,3], [2,10], and [2,11], and the variant 


If 

'0 

1 

o' 

h 0 

0 

0 

1 

liv 

1 

0 

0 


U, 


0 1 
0 0 
1 0 


0 

1 

0 



0 10 
0 0 1 
1 0 0 


'0 

1 

o' 

\ 1 

0 

0 

1 

w 

1 

0 

0 

J J 


at nodes [2,4], [2,5], [2,7], [2,9], [2,12], and [2,13]. We suspect that better two-level 
stability factors are achievable. 

5.3. Performance and stability trade-offs with a small number of recur¬ 
sive levels. In addition to searching for better algorithms, we may also consider the 
effect of the number of recursive levels on the numerical stability. We now consider 
the performance and stability of fast matrix multiplication algorithms across several 
base cases and several values of L. Table 1 summarizes the best known (to us) stabil¬ 
ity factors (E) for several practical base case dimensions. The columns of the table 
represent the relevant performance and stability parameters for each algorithm, all of 
which can be computed from the [U, V, W] representation. 
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The rank R and the number of nonzeros (nnz), along with the number of re¬ 
cursive levels used, determine the number of floating point operations performed by 
the stationary version of the algorithm. The rank can be compared to the product 
AIqKoNo, the rank of the classical algorithm for that base case. The quantities Q 
and E are computed using Definitions 1 and 2, respectively; for a given base case we 
report the algorithm with the best known E along with that algorithm’s Q. We do 
not report both ( M 0 ,K 0 ,N 0 ) and (N 0 , K 0 , M 0 ) because the best algorithms for each 
have identical nnz, E, and Q parameters, due to transformations corresponding to 
transposition of the matrix multiplication. 

Although we stress that these algorithms will be used with only a few levels of 
recursion, we also report the asymptotic stability exponent (stab, exp.) in order 
to compare algorithms across different base case dimensions. If an algorithm for a 
square base case (Nq, N 0 , N 0 ) is used on square matrices of dimension N down to 
subproblems of constant dimension, the bound of Theorem 3 can be simplified to 

(12) ||C-C|| < c - AT log ~o ^ log A7|| A|| ||B||e + 0(e 2 ), 

where c is a constant that depends in part on Q. In this case, the stability exponent 
is log A r o E. We note that the first two rows of Table 1 match the results of Bini 
and Lotti [4, Table 2]. The most stable rank-23 (3,3,3) algorithm of which we are 
aware is a cyclic rotation of the one given by Smirnov [24] . In the case of rectangular 
base cases (Mq. Kq, Nq). we assume a uniform, non-stationary algorithm based on 
cyclic use of algorithms for (M 0 ,K 0 ,N 0 ), (N 0 ,M 0 ,K 0 ), and ( K 0 ,N 0 ,M 0 ), where the 
three recursive rules are transformations of each other, either by cyclic rotations or 
transposition (for more details, see B and C). 

Figure 2 shows the distribution of relative instability and percentage of classical 
flops for the algorithms in Table 1, for L = 1,2,3,4. We measure both terms asymp¬ 
totically. Ignoring the quadratic cost of additions, the percentage of classical flops is 
given by (R/(MqKqNq)) l . For large matrix dimension and L small, we can ignore 
Q by Theorem 3, and we define the relative instability to be ( E/Kq ) l , which is the 
factor by which the error bound exceeds that of the classical algorithm. In general, 
most algorithms follow a narrow log-linear trade-off between these two parameters. 
However, there is still room to select algorithms for a fixed number of recursion levels. 
For example, with L = 1, the (3,3,3) algorithm has roughly the same stability and 
does fewer floating point operations than Strassen’s algorithm. 

6 . Scaling. We now turn our attention to strategies for pre- and post-processing 
matrices in order to improve numerical stability. The error bounds from section 4 can 
be summarized by the following element-wise absolute error bound 

(13) \cij — Cij | < /ai g (AT)|| A|| ||731|e + 0(e 2 ). 

Recall that / a i g is the (at worst) polynomial function of the inner dimension that 
depends on the particular algorithm used. Unfortunately, these bounds can often be 
quite large when |cjj| is small relative to ||A||||H||. The purpose of this section is to 
address the contribution of ||A|| and ||H|| to the error bound, ignoring the particular 
fast algorithm that is used. Thus, for the remainder of this section, we will ignore 
/ai g and consider it a fixed quantity, so that f a i g (K)e = O(e), and we will focus on 
relative error. 

The following example shows that the relative error from fast matrix multiplica¬ 
tion computations can be large. We note that for the purposes of this example and 
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Fig. 2: Distribution of relative instability, ( E/Kq ) l , and percentage of classical flops, 
(R/{M 0 K 0 N 0 )) l , for the algorithms in Table 1 with L = 1,2, 3,4. A larger value on 
the y-axis means a less stable algorithm, and a smaller value on the x-axis means a 
faster algorithm for large problem sizes. There is a general log-linear trade-off between 
stability and number of floating point operations. 


subsequent examples throughout the paper, we assume that floating point operations 
incur an e relative error even if the operands happen to be powers of two or if we are 
subtracting identical values. This assumption does not limit the generality of our ex¬ 
amples; instead, we have chosen the matrix entries to make the examples as simple as 
possible. One could apply small independent relative perturbations on matrix entries 
for the examples to work in standard floating point arithmetic. 

Example 8. Consider the matrices 


(14) 


A = 


1 1 
1 1 


B = 


C = A B = 


2 z 

2 z 


2 

2 ' 


for small z > 0. By (13), we have the following relative error bound 


(15) 


|cn — Cn| 
|cil| 


< O 


f II A|| ||B||e \ 

l |cii| ) 


0 {e/z ), 


which can be quite large for small z. Furthermore, this bound is actually achieved with 
Strassen’s algorithm (see A for the definition of Strassen’s algorithm). Specifically, 
Strassen’s algorithm computes 


m i = (an + a 22 )(&ii + b 22 ) = (1 + 1 ) • (2 + 1 ) 

m 4 = a 22 (b 2 i - 611 ) = 1 • (z - z) 

m 5 = (an + a 12 )b 22 = (1 + 1 ) • 1 

TO 7 = (012 — a 22 )(b 2 i + b 22 ) = (1 — 1 ) ■ (2 + 1 ) 

Cn = mi + 1JI4 — m 5 + my. 
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There are terms of size 0(1) in computing mi, mi, m 5 , and 7717 , so the absolute 
error |cn — cu| is O(e). Since Cn = the relative error is 0(e/z). 

We now demonstrate several methods for improving numerical stability issues by 
pre-processing A and B and post-processing C. The idea underlying these methods 
is the following straightforward observation 

(16) C = DiD^ADD-'BDg'Ds, 


for any nonsingular scaling matrices D 4 , D^, and D. By taking advantage of the 
associativity of matrix multiplication (in exact arithmetic) and scaling matrices D^, 
Db, and D that are easy to apply, we can improve the norm-wise bound in Theorem 3 
without significantly affecting the performance of the algorithm. 

For the algorithms and analysis in this section, we will consider diagonal scaling 
matrices with positive diagonal entries. In order to simplify the analysis, we will as¬ 
sume that there is no numerical error in applying or computing the scaling matrices. 
This could be achieved, for example, by rounding the scaling matrix entries to the 
nearest power of two. Regardless, the error introduced by the fast matrix multipli¬ 
cation algorithm has the larger impact on the stability, and the scaling matrices can 
curb numerical inaccuracies. 

6.1. Outside scaling. In light of (16), Dumitrescu proposed the following out¬ 
side scaling matrices [13]: 


D 4 = diag ^max a,,- 1 ^ , Db = diag ^inax |bij | ^ . 


The resulting procedure is Algorithm 1 . 


Algorithm 1 Outside scaling for fast matrix multiplication 
Require: matrices A and B 
Ensure: C = A • B 

1 : T)a t— diag(max,, ||) 

2: A' i — D^A 

3: D s diag(maxi |6jj|) 

4: B' <- BD)) 1 

5: C' A ■ B with fast matrix multiplication. 

6 : C <- DaC'Db 


Clearly, the algorithm correctly computes C = A B in exact arithmetic, provided 
there are no all-zero rows in A or all-zero columns in B. Importantly, the norm-wise 
bound in Theorem 3 applies to the scaled matrices A! and B 7 . In particular, we get 
the following improved bound [13]. 

Proposition 9. Using Algorithm 1 , 

I Cij ~ Cij\ < 0(e)IK:llll & :dl|. 

Proof. Outside scaling ensures that ||A'|j = ||B , || = 1, so by (13), ||C' — C || < 

O(e). Since C' — C = D^i(C — C)Db, the result follows from the fact that the ith 
diagonal entry of is ||a ij: || and jth diagonal entry of D s is □ 
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For the matrices in Example 8 , the bound from Proposition 9 improves upon (15): 


jcn - Cn| 

kill 


f IMIIIMIA 

1 kill ) 


0(e) 


This indeed improves the numerical stability of Strassen’s algorithm. For the 
matrices in (14), the outside scaling is 


A' <- 


1 

1 


1 

1 


B' 


■<— 



= A' B' = 


2 

2 


And when computing C 7 with Strassen’s algorithm, 


mi = (a 'ii + a! 22) (b'li + b'22) = (1 + 1 ) • (1 + 1 ) 
TO 4 = a' 22 $ 21 — b' a) = 1 • (1 — 1) 
m 5 = (a' 11 + a' 12)^22 = (1 + 1) • 1 
?ri7 = (a' 12 — ei 22){b'21 + b'22) = (1 — 1) • (1 + 1) 
d 11 = mi + TO 4 — 7715 + 7717- 


Now, all sub-terms are on the order of unity, so the relative error in computing 
du is O(e). 


6.2. Inside scaling. There are pairs of matrices where outside scaling is not 
sufficient for numerical stability. 

Example 10. Consider the matrices 


(17) 


A = 



C = A B = 


2z 

2z 


2 z 
2 z 


for small z > 0. Using outside scaling on these matrices does nothing since = 
Dg = I. However, using a fast algorithm can still have severe numerical stability 
issues. Computing C 12 with Strassen’s algorithm uses the following computations: 


m 3 = an(bi 2 - b 22 ) = 1 • (z - 1 ) 
m 5 = (an + ai 2 ) 6 2 2 = (1 + z) ■ 1 
C 12 = m 3 + 777 , 5 . 


The computation of 777.3 and m 3 has terms of unit size, so |ci 2 — £12 is 0(e) and the 
relative error is 0(e/z). This is reflected in the bound from (13): 


IIA || ||B||/|c 12 | = 1/(2z). 

We now propose a technique called inside scaling based on the following matrix: 

<"> —(/si) 

The resulting procedure is in Algorithm 2. The idea is to scale the columns of A and 
the corresponding rows of B to have the same norm. In general, we get an improved 
error bound, as detailed in Proposition 11. We note that there exist several references 
to inside scaling [1, 11, 16, 22], though to our knowledge this is the first explicit 
statement of the diagonal values in (18). A crude inside scaling method was proposed 
earlier by Brent [7], where the inside scaling matrix is D = ^/||B||/||A||I. 
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Algorithm 2 Inside scaling for fast matrix multiplication 
Require: matrices A and B 

Ensure: C = A ■ B 

i: D «— diag(. 

° \ V max i \ a ik \ J 

2 : A' <- AD 
3: B' <- D _1 B 

4: C <— A! ■ B / with fast matrix multiplication. 


Proposition 11. Using Algorithm, 2, 

||C-C|| < 0(e)max|o ifc ||6 fcj -|. 

i,k,j 


Proof. By (13), 

||C-C|| <0(e)||AD||||D- 1 B|| = 0(e) (max||a : , fe ||4fc) (maxd^HMJ ■ 
By the definition of D, 


\\a : ,k\\d k k = d k ^\\b k ,:\\ = fellll^fe,:||, 

so the two maxima are attained at the same index. The result then follows from the 
fact that ||a : , fc |||| 6 fc , : || = max iife j \a ik \\b kj \. □ 

For A and B in Example 10, maxj^j a,;fc| \bkj = z, and we get an O(e) relative 
error bound for computing each entry in C. The inside scaling updates to the matrices 
in (17) are 


V* 

0 1 lyfz_ 





Strassen’s algorithm now computes 


m 3 = a , 11 ( 6 , 12 - b' 22 ) = yfz ■ (y/z - \fz) 
m 5 = (a' n + a[ 2 )b 22 = {yfz + yfz) ■ yfz 
c'i 2 =m 3 + m 5 . 


This time, the computation of m 3 and m 3 involves terms on the order of z instead of 
on the order of unity, and we get 0 (e) relative error in the computation. 

6.3. Repeated outside-inside scaling. Next, we consider repeatedly applying 
outside and inside scaling in alternating order, as shown in Algorithm 3. This process 
can only improve the error bounds, and it eventually converges. Outside and inside 
scaling can simply be applied several times, or the user can specify a cheaply computed 
stopping criterion that will guarantee a relative distance from the limit point. 

We start with our accuracy analysis. In our analysis, we use A*^ and B (,) to 
denote the values of A' and B', respectively, after t steps of Algorithm 3. We also 
use rf 1 and to denote the diagonal elements of and Da, respectively, after t 
steps. The initial values of these variables correspond to t = 0 in our notation. 
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Algorithm 3 Repeated outside-inside scaling for fast matrix multiplication 

1: A' A, B' <- B, B a <- I, D s <- I 

2 : alternate 
3: O step 

4: t— diag(maxfc|a'ifc|) 

5: <— DaD a 

6 : A' <- (D^)- 1 A 7 

7: D' s <- diag(max fc | b' kj |) 

8 : Dfl <— DgDfl 

9: B' <- B'iB's)- 1 

10 : end 

11 : I step 

12. D^diag(yigEj) 

13: A' <- A'D 

14: B' <- D _1 B' 

15: end 

16: until converged 

17: C'<-A' • B 1 with fast matrix multiplication 
18: C <- DaC'Djj 


Proposition 12. Let t be the number of steps of Algorithm 3 that we complete. 
The computed product satisfies 

\ c ij — c»j| < 0(e) r^af ||A^||||B^||. 

Proof. If the last step is an 0 step, then following the proof of Proposition 9, 
||A ( ^|| = ||B^|| = 1 and | 4 — 4 | < O(e). If the last step is an I step, then by 
Proposition 11, 

14 - 41 < 0(e)max|aW||6g| < 0(e)||A<‘>||||B^||. 

The result then follows from the fact that C — C = D^C 7 — C )Db. □ 

We now state the main result of this section. 

Theorem 13. The sequence 




IIA (t) |||!B (t) | 


fort = 0 , 1 ,... 


is monotonically nonincreasing and converges linearly. 

Proof. See D. □ 

This result implies that we can safely iteratively apply inside and outside scaling 
to improve the error bounds, but this process provides diminishing returns. 

Next, we introduce our stopping criterion, which requires the following additional 
notation. Whenever step t is an O step, we use r'^ and sj*^ to denote the diago¬ 
nal elements of the matrices and B’ b , respectively, that we compute in step t. 
Similarly, if step t is an I step, pjg denotes the diagonal elements of D. 

The stopping criterion works as follows. We test the intermediate scaling factors 
p k \ r^ and s'j^ in each iteration starting with the one that immediately follows 
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the first O step. In the / steps, we test whether all of the p'fp fall within the interval 
[(1 + t)“5, (1 + t)s] , and in the 0 steps, we test whether all of the and s'^ 
are greater than the threshold (1 + t) - 1 / 2 . Whenever one of these conditions is true, 
Theorem 14 below states that we are within a relative distance r from the limit, and 
so we stop iterating. In practice, we may just specify t steps of scaling a priori, so 
as to have a better handle on the overhead of the pre-processing. We explore the 
performance overhead of the pre-processing in subsection 6.5. 

We now state the theorem that justifies the stopping criterion. As we show in D, 
the sequences rf\ s^\ ||A (/) || and ||B^|| converge. We use a superscript * to denote 
their limits, so that 


M 


„M 


M 


» 


vWi 


-A HA' 


Mi 


B 


(*) | 


-> IB 


Mi 


and we let 


= 


(*)„(*) ii *(*) i 


|B (t) || - || A w 


B 


Mi 


rW.^IIAWllllBW 


be the relative distance of ||A^ j| ||B ( ^ || from the limit. We use t 0 to denote the 

index of the first O step of the iteration. 

Theorem 14. Let t > 0 be a user-specified tolerance parameter. We have that 
for t = t 0 ,t 0 + 2 ,..., 

max/i|* +1) <t if min > (1 + t) _i and maxp^ +1 ' < (1 + r) 3 , 

i,j J k k 

and 

max< r if minr^ t+2 ' ) > (1 + r )“2 an d mins^ t+2 ^ > (1 + t)~ * . 

i,j J i j 


Proof. See E. □ 

Finally, we note that Algorithm 3 does not specify which form of scaling to apply 
first. While the analysis in this section applies to either choice, we note that the limits 
of the two sequences are not identical -they depend on which step is applied first. 
We conclude this subsection with examples demonstrating that these two choices can 
produce significantly different results (in the case of one iteration of Algorithm 3) and 
that neither choice is always preferable. Example 15 shows a case where performing 
outside followed by inside scaling is more accurate than performing inside followed by 
outside scaling; the opposite is true for the case of Example 16. 

Example 15. Consider the matrices 


A = 


1 

1 




1 

1 ’ 


C = A B = 


1 + z 
2z 


1 + z" 1 
2 


for small z > 0. We consider one step of alternating scaling. Performing outside and 
then inside scaling computes 


(19) 


A' <- 


z 1 

, B' u- 

'1 

1 ' 


'1 + z 

1 + z' 

1 1 

1 

1 

, c 

2 

2 
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and inside followed by outside scaling computes 


2 1 / 2 1 ■ 


2 1 / 2 * 1 / 2 - 


'1 + 2 1 / 4 

1 + 2 1 / 4 ' 

. 1 zl/2 . 

, B' <- 

1 1 

, C' 4- 

_ 2 • 2 1 / 2 

2 • 2 1 / 2 _ 


Consider the computation of entry C 21 with Strassen’s algorithm: 

m 2 = (a' 21 + a '^ b ' u , m 4 = a ' 22 { b ' 21 - &' n ), c 21 = m 2 + m 4 . 

With A' and B' in (19), all sub-terms are 0(1) and c 21 is 0(1), whereas for A' and 
B' in (20), there are 0(2 4 / 4 ) sub-terms and c 21 is 0(z 1 / 2 ). 

From Proposition 12, the absolute error bound for entry C 21 = 0(z ) with no 
scaling is 0 ( 1 / 2 ), with outside and then inside is O(z), and with inside and then 
outside is O^/z 1 / 2 ). Thus, using only one step of Algorithm 3, the accuracy of 
starting with an O step can be much better than that of starting with an / step. 

Example 16. Consider the matrices 


A = 


'1 z 

, B = 

2 1 

II 

PQ 

< 

II 

U 

2 2 

2 

^ _1 

9 

i+2 

z z 


1 2 1 


_2 + 2 2 


for small 2 > 0. We consider one step of alternating scaling. Performing outside and 
then inside scaling computes 


2 + 2 2 1 + Z 

2 z 2 

and inside followed by outside scaling computes 


( 21 ) 


A' 


1 2 

1 1 


B' <- 


C' <- 


( 22 ) 


'1 1' 

, B' ^ 

1 

1 ‘ 


2 

2 

2 1 

1 

1 

J C •<— 

1 + 2 

l + z_ 



Consider the computation of entry C 21 with Strassen’s algorithm: 


m 2 = ( a ' 21 +a 22 )b' n , m 4 = a ' 22 { b ' 21 - b ' n ), c ' 21 = m 2 + m 4 . 


With A' and B' in (21), c 21 is 0[z ) but there are 0(1) subterms, whereas for A! and 
B' in (22), c 21 and all subterms are 0(1). This case illustrates that using only one 
step of Algorithm 3, the accuracy of starting with an I step can be much better than 
that of starting with an O step. 


6.4. Scaling is not always enough. We now provide a simple example that 
shows how Strassen’s algorithm computes a result with large relative error, using any 
of the scaling algorithms presented in this section. 

Example 17. Consider the matrices 


(23) 


A = 


1 2 

2 1 


B = 


1 2 

2 1 


C = A B = 


1 + 2 2 
2 2 


2 2 

1 + 2 2 


for small 2 > 0. In this case, both outside and inside scaling leave the matrix un¬ 
changed. When computing C12, 


m 3 = an(6i2 - b 22 ) = 1(2 - 1) 

m 5 = ( a ll + ai 2 )6 2 2 = (1 + 2 )1 

C12 = m 3 + m 5 , 


There are subterms on the order of unity, so the relative error is 0 ( 1 / 2 ). 
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Number of recursive levels (L) 




Number of recursive levels (L) Number of recursive levels (L) 


Fig. 3: Relative error of Strassen’s algorithm as a function of the number of recursive 
steps, L, for several scaling techniques. The results in each plot are for matrices 
A and B sampled from different probability distributions. (Top) Stability is well- 
behaved, and no scaling is necessary for small relative errors. (Middle) The matrices 
are adversarial and inside-outside or 2-times repeated outside-inside scaling have the 
smallest relative errors. (Right) The matrices are adversarial and outside, outside- 
inside, or 2-times repeated outside-inside scaling have the smallest relative errors. 


6.5. Numerical experiments. We tested the scaling algorithms on samples of 
random matrices whose entries were not as contrived as those in the prior sections. 
We used a sample of A € M. NxN and B !VxJV from the following distributions: 

1. a-ij,bij ~ Uniform(0, 1) 

2. a-ij ~ Uniform(0, 1/N 2 ) if j > A/2, otherwise, ~ Uniform(0,1); bij ~ 
Uniform(0,1/A 2 ) if i < A/2, otherwise bjj ~ Uniform(0,1) 

3. a,ij ~ Uniform(0, A 2 ) if* < A/2 andj > A/2, otherwise, ~ Uniform(0,1); 
bij ~ Uniform(0,1/A 2 ) if j < A/2, otherwise b^ ~ Uniform(0,1) 

Samples from the first distribution are well-behaved for fast matrix multiplication 
algorithms. On the other hand, samples from the second and third distributions are 
adversarial and model the matrices in Examples 10 and 15, respectively. 

We sampled 100 pairs of matrices (A = 2000) from each distribution and com¬ 
puted the error of Strassen’s algorithm with L recursive levels, L = 1,2,... ,6. Specif¬ 
ically, the error was the maximum value of max^ |dij — Cij\/\cij\ over the 100 samples, 
where C was computed with quadruple precision. Figure 3 plots these errors. For 
the first probability distribution, the relative errors are all roughly the same. With 
the second distribution, only inside-outside scaling or 2-times repeated outside-inside 
scaling compute relatively accurate solutions. In this case, inside and outside-inside 
scaling are moderately more accurate than no scaling or outside scaling, but they still 
produce relative errors several orders of magnitude larger than the best case. Finally, 
for the third distribution, inside scaling and no scaling result in much larger relative 
errors, and inside-outside scaling is slightly worse than outside, outside-inside, or 10- 
times repeated outside-inside scaling. These experiments demonstrate that with no 
prior knowledge of the distribution, repeated outside-inside scaling is the safe choice 
for fast matrix multiplication. 

Each iteration of outside or inside scaling is 0(MK + KN + MN) flops, so scaling 
does not affect the asymptotic performance. However, quadratic costs do affect the 
practical implementation of fast matrix multiplication [3]. Subsequently we tested 
the performance impact of scaling. We use effective gflops [3, 22] to measure the 
performance of multiplying an M x K matrix by a K x A matrix: 

2 • MKN - MN 


(24) 


time in seconds 


■ le-9. 
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^ Performance of N x N X N matrix multiplication 


□ Strassen 

—J|f-Strassen + 2x O-l scaling 



20 I-1-■-■-■-■- 

1000 2000 3000 4000 5000 6000 7000 

Dimension (N) 


Fig. 4: Performance of Strassen’s algorithm (L = 1), with and without two steps of 
inside-outside scaling, and the classical algorithm. 


This lets us compare fast matrix multiplication algorithms to the classical algorithm 
on a familiar inverse-time scale. All experiments were conducted on a single compute 
node on NERSC’s Edison machine. Each node has two 12-core Intel 2.4 GHz Ivy 
Bridge processors and 64 GB of memory. Our experiments were single-threaded. We 
report the median of five trials for each timing result. 

Figure 4 summarizes the performance results for Strassen’s algorithm (L = 1), 
with and without two steps of Algorithm 3, for multiplying square matrices of di¬ 
mension N. There is a noticeable impact on performance. Strassen’s without scaling 
out-performs the classical algorithm for N > 2500 while scaling pushes this threshold 
to N > 3500. As N grows, the performance impact of scaling gets smaller. This 
follows from the asymptotic analysis—as N grows, the impact from quadratic terms 
shrinks. 

7. Discussion. One of the central components of our algorithmic error analysis 
is that two data-independent quantities drive the error bounds for fast matrix multi¬ 
plication. First, Q captures the accumulation error from adding matrices. Second, E 
bounds the growth in the magnitude of intermediate terms. Our results in section 5 
show that having a small E is important, but this does not fully characterize stability 
in practice. The same result has been observed when comparing Strassen’s algorithm 
and the Winograd variant [16]. An encouraging result from our experiments is that the 
number of non-zeroes in the U, V, and W matrices, which determines the constant 
in the computational complexity, is positively correlated with E. In other words, for 
a given base case and algorithm rank, by improving performance, we generally also 
improve stability. Another lesson from our analysis is that we should not think of 
using fast algorithms asymptotically but rather as having a fixed number of recur¬ 
sive levels. This leads to better performance in practice [3] and also to the improved 
error bounds and numerical stability presented in sections 4 and 5. Finally, because 
the principal quantities for understanding algorithmic error (E and Q ) are indepen¬ 
dent of the asymptotic complexity, we have new metrics over which to optimize when 
searching for fast matrix multiplication algorithms. 

For performance reasons, the best choice of fast algorithm depends on the shape 
of the matrices being multiplied [3]. In general, a choice of algorithm can be made at 
each recursive level. Subsequently, we believe that uniform non-stationary algorithms 
are the right choice in practice for achieving the best performance. Theorem 5 provides 
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the appropriate error bounds for this case. 

The analysis in subsection 4.5 formalizes the error analysis for existing tech¬ 
niques to improve stability of Strassen’s algorithm and the Winograd variant [8, 9] 
and also generalizes the approach for all fast matrix multiplication algorithms. The 
analysis provides the formula over which to optimize when considering non-uniform, 
non-stationary algorithms. However, finding the best algorithm is a combinatorial 
optimization problem that grows exponentially in the number of recursive levels. Al¬ 
gorithm design in this space is an interesting avenue for future research. 

Using these algorithmic techniques improves the normwise accuracy of the com¬ 
puted product. However, because the errors are normwise, small elements of the 
product can be computed less accurately than warranted by their condition numbers. 
By pre- and post-processing the data, we can improve componentwise accuracy as 
well. Specifically, we analyzed a hierarchy of diagonal scaling techniques that re¬ 
duce the number of cases where fast matrix multiplication yields relatively inaccurate 
small entries in the product. Nevertheless, there are cases that cannot be solved by 
our diagonal scaling algorithms (e.g., Example 17). When scaling helps, a couple of 
iterations are sufficient, and this is backed up by Theorem 13. 

The asymptotic operation cost of diagonal scaling is proportional to the size of the 
matrices, so it is dominated by cost of current matrix multiplication algorithms. In our 
experiments, we found that scaling does incur a noticeable performance penalty for 
reasonably sized matrices, but fast matrix multiplication with diagonal scaling still can 
outperform the classical algorithm. We note that our diagonal scaling implementation 
is not fully optimized; for example, it is possible to overlap inside and outside scaling 
and delay updating actual matrix entries, which can both reduce the memory traffic 
overhead. 
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Appendix A. Strassen’s Algorithm. 

Strassen’s algorithm [25] is a (2, 2,2) algorithm specified by the following U, V, 
and W matrices: 


U = 


V = 


W = 


1 0 

0 0 

0 1 

1 1 

1 1 

0 0 

0 0 

1 0 

1 0 

0 1 

0 0 

1 -1 


1 0 

0 0 

0 0 

0 1 

0 -1 
1 0 

0 1 

-1 0 

0 1 

0 1 

1 0 

1 0 


1 -1 
1 0 

0 1 

0 0 

0 1 

0 1 

0 0 

1 0 

-1 0 

0 0 

1 0 

0 1 


0 

1 

0 

-1 

0 

0 

1 

1 

1 

0 

0 

0 


Note that the rows of U and V correspond to a column-major ordering of the entries 
of the input matrices and the rows of W correspond to a row-major ordering of the 
output matrix, following the convention of previous work [6, 17]. We point out that 
this algorithm is cyclic-invariant, so that [U,V,W] = [W, U,V] = [V,W,U] (up 
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to permutations on the columns of the matrices), which implies that all three rotations 
have the same Q and E values. 


Appendix B. (3, 2, 3) fast matrix multiplication algorithm. 

The following algorithm for base case (3, 2, 3) has 94 nonzeros with E = 20 and 

0 = 10 . 


u = 


V = 


w = 


0 

0 

0 

-1 

-1 

0 

0 

0 

0 

-1 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 


1 

1 

0 

0 

0 

0 

0 

0 

-1 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 


0 0-1 
0 0 0 

0 1 0 

1 0 1 

0 0 0 

-10 0 

1 1 0 

1 0 0 

0 0 0 

0 0 0 

1 0 0 

1 0 1 

1 0 0 

0 0 1 

0 0 1 

0-1 0 
0 0 0 

0 0 0 

0 1 0 

0 0 0 

0 -1 1 


1 1 0 

0 0-1 

0 0-1 

0 0 0 

1 0 1 

0 0 1 

0 0 0 

0 0-1 

1 0 0 

0 0 1 

0 1 0 

1 1 0 

0-1 0 

1 0 0 

0 1 0 

0 0 0 

0 0 0 

1 -1 0 

0 0 0 

0 0 1 

0 0 0 


0 0 

0 -1 

0 -1 

0 0 

0 1 

1 0 

0 0 

1 -1 

0 1 

0 1 

0 0 

0 0 

1 0 

0 0 

0 0 

0 -1 

0 0 

0 0 

1 0 

1 -1 

0 0 


-1 0 

0 0 

1 0 

1 0 

0 1 

-1 0 

1 0 

0 0 

0 1 

0 1 

1 0 

0 0 

0 0 

0 -1 
0 0 

0 1 

0 1 

0 0 

0 0 

0 0 

1 0 


0 0 

1 -1 

0 -1 

0 0 

-1 0 

0 0 

0 1 

0 1 

1 -1 

0 -1 

0 0 

0 0 

0 0 

0 0 

0 0 

0 -1 

1 0 

1 0 

0 0 

-1 0 

0 0 


-1 

0 

0 

1 

0 

-1 

1 

0 

0 

0 

1 

1 

-1 

0 

0 

0 

0 

0 

0 

0 

-1 


Note that the rows of U and V correspond to a column-major ordering of the entries 
of the input matrices and the rows of W correspond to a row-major ordering of 
the output matrix, which implies that [W, U,V] is an algorithm for (3,3,2) and 
[V, W, U] is an algorithm for (2, 3, 3). 


Appendix C. (4,4,2) fast matrix multiplication algorithm. 

The following algorithm specifies a rank 26 fast matrix multiplication algorithm 
with base case (4,4,2). 
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V = 


w = 


0 
0 
0 
0 
2 
2 
2 
0 
0 
2 
2 
0 

-2 
0 
0 
0 

0 
1 
0 
0 

1 -2 

1 -2 


0 -1 

0 0 


0 
0 
2 
0 
0 
0 
0 
0 
0 
0 
0 

0 1 

0 0 

0 0 

0 0 

0 0 

-2 0 

2 2 

-2 -2 

2 -2 


1 -1 
0 


0 1 
0 -1 


-2 0 

0 0 

0 0 

0 0 

0 0 

0 0 

0 0 

1 2 

0 0 

0 -2 
0 -2 
1 0 

0 0 


0 -1 

0 0 


-2 -2 

0 -2 

0 0 

2 0 

-2 
-2 


0 0 0 -2 

2 2 2 0 

-2 


-1 -2 

0 0 


0 
0 
0 
0 
0 
0 
2 

-2 
0 
0 
0 
0 
2 
0 
2 

0 0 

-1 0 

0 0 

1 0 

1 -1 
-1 0 


-2 -2 

0 0 


0 -2 

0 0 


0 -1 

0 0 


0 0 
-2 -2 


2 -2 

0 0 


0 
0 

110 0 1 


0 -2 

2 2 

0 -2 

0 -2 

-2 
-2 


-1 -1 

1 0 


0 -1 

1 -1 


0 -2 

0 0 


2 -1 

2 0 


O' 

0 

0 

0 

2 

0 

0 

0 

0 

0 

0 

0 

-2 

0 

0 

0_ 

O' 

0 

0 

0 

2 

0 

-2 

0_ 

r 

-1 

0 

0 

0 

0 

0 

0 


Note that the rows of U and V correspond to a column-major ordering of the 
entries of the input matrices and the rows of W correspond to a row-major ordering 
of the output matrix, which implies that [W,U, V] is an algorithm for (4,2,4) and 
[V, W, U] is an algorithm for (2,4,4). However, [V, W, U] yields an E = 102, which 
is greater than [U, V, W]’s 89. The E value can be maintained for base case (2,4,4) 
by using a different transformation that corresponds to transposing the matrix mul¬ 
tiplication: [ - P 4 j 2 V,'P 4 ! 4 U,'P 2 , 4 'W], where Vm,n is the so-called vec permutation 
matrix [15], exchanging column-ordering for row-ordering for a vectorized m x n ma¬ 
trix. 


Appendix D. Convergence analysis of alternating scaling. 

In this appendix we prove Theorem 13. We start with its first part. 

Lemma 18. The sequence 


s (t) = g(t—i) ii | 


W, 


||aW||||bW|| fort = 0,1,... 

is monotonically nonincreasing. 

Proof. If step t is an O step, then 

||AM|| = ||B (t) ||=l, r| t) =r| t - 1 ) ||ag" 1) ||, 

and therefore 

rf^llA^lllIBWlI = rf- 1) ^- 1) ||a^ : - 1) ||||6j- 1) || 
<ri*" 1 ) sj. t_ 1 ) ||A (t - 1 ) ||||B (t - 1) ||. 

Next, assume that step t is an / step. Column k of A' is transformed so that 




'll^llY (t-u 

bn ’ 
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and therefore 


(25) 

and similarly 

(26) 

Hence, 


i « Sii = 


. ik ( r } i 


a. 


(*—i)i 


= \ a. 




ii^jii = VK^iiii^r 131 


|A (t) || = IIB^II = 


max a / \ a 


,(* — !) II II 


r W = r (t- 1) 


,(*) _ 

b j ~ b j 


and therefore 


4« ) s (t ) || A (/ ) |||j B ( t ) || = r d !) s d ^(max^Ha^ 1} ||||6£* : 1} ||) 


= r?- 1) 4 t - 1) maxri|a (t - 1) ""^- 1) ' 


:,k 


"k 


< rf _1) s^" 1) ||A (t ” 1) ||||B ( ‘" 1) ||. □ 

Next, we prove that the factors in the sequence of Theorem 13 converge individ¬ 
ually. This is required in the subsequent analysis. 

Lemma 19. The sequences rf\ s^\ ||A^|| and ||B^|| for t = 0, 1,... converge. 

Proof. As we show in the proof of Lemma 18, 


||A«|| = ||B (t) || =1, 

|A (t+1) || = ||B( t+1 )|| = max^||ag||||6g| 


for t = t 0 ,t 0 + 2 ,... . 


Therefore 

IIA (t+1) || = max V / ||ag||||6g|| < ^/||AW||||BW|| = 1, 

and hence 

( 27 ) r' (t+2) = |k ( ‘ + 1 ) ||<||A (t+ 1 )||<l. 

The sequence r satisfies 

r ( to ) _ r (*o+i) _ '(to) 
i i i 

^(to+2) _ ^(*0+3) _ r '(t 0 ) r '(t 0 +2) 


It is nonnegative because > 0, it is monotonically nonincreasing by (27), and 
hence it must converge. The same is true for s^\ 
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Consider the effect of the first t steps of the iteration on A 1 and B 7 . The cumu¬ 
lative effect of the 0 steps is to divide the rows of A' by rf 1 and the columns of B' 
by sf\ and that of the I steps is to make sure that every column of A' is equal in 
norm to the corresponding row of B / . Therefore 


(t) 1 

a ik = a ik ~ t) 


max. 


\bkj/' 


S*-l) I 


(t) I | / ( 

r\ \maxi | a ifc /r> 




for t = t 0 + l,to + 2,... 


which shows that the convergence of and s'j 1 guarantees the convergence of off 
and hence also the convergence of ||A ( ^||. The same is true for ||B (, ^||. 

The following lemma shows that the intermediate scaling factors that we compute 
in each step rapidly converge to 1. We use the notation 


S*) 


( 1) 


□ 


w ^ = max^max|logr(^ |, maxjlog sj 1 ' 1 , 


^(t+i) _ max|log p 


r‘i 


for t = ta,ta + 2,... . 


Lemma 20. The following bounds hold: 


w 


(t+1) < 0 ,5w w 


fort = f 0 + 2, t 0 + 4,... . 


Proof. Assume that t = to + 2, to + 4,... . Because step t — 2 is an 0 step, there 
is a column g so that la,-* 2 ^| =1, and therefore 


t(t) I (t-r)| | (i- 2 ) /(t-r)| ^ | (t- 2 ) /(i-l)l r(t-l) 

rV=max| a\ k ; |=max|a) fc ; |>| a\ g >p g ( >\=pJ >. 


Taking logarithms yields 


logr' (t) > logp' (t 1} . 


Both sides of this inequality are nonpositive because r'f^ < 1 by (27), and so 
|logr' (t) | = -logr' W < - log= |logPg (t_1) |. 

A similar analysis shows that 

|logs' (t) | < |logp' / (t ^ 1) | , 

for a suitably defined row /, and these two inequalities imply the first bound in the 
statement of the lemma. 

Next, let us prove the second bound. We have that 

= 


maxj 

m 

maxj 

l6 (t " 1} /s {t) 1 
1% / s j \ 

„ max i| b kj 11 1 max j( 1 / s ? ) ) 

max^l 

a< fk\ 

max^| 

dcvd'i 

- | (t-i) / /(t)i 

max* a) fe ’/r>’\ 
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Inequality (27) states that r'^ < 1, and therefore 

l (t-i) / /(t) | . | (t-i) | 

max \al k / r>' \ > max| a\ k \ , 
which we substitute into the previous inequality, obtaining 

< 

By (25) and (26) 


ma Xj 

l6 (t_1) l 

\°kj 

max, (1/s' (t) ) 

ma Xi 

L (t_1) 

1 a ik 



max a 




ik 


kj 


which implies 


i]Pk +1) ) 2 < maxj(l/s^ (t) ) • 


A similar analysis shows that 




max. 


m 


r (t)\ 


Taking the logarithm of these two bounds and interchanging the positions of the 
logarithms with those of the max operators yields 

- max(log(l/r' (t) )^ < 2 log p k +1 ' > < max(log(l/s' (t) )^ . 


Because r ■ < 1 we have that log(l/r- ) = |logr^ '|, and similarly for s'j. Apply¬ 

ing this to the previous inequality yields 


— maxllog I < 2 log P k t+1 ^ — maxllogs^ 


'(*) I 
j 


and therefore 


2|logp^ t+ y < max (max | log r[^\, max | log | ^ , 


which implies the second bound in the statement of the lemma. □ 

The following lemma proves linear convergence, and therefore completes the proof 
of Theorem 13. 

Lemma 21. There is a sequence so that 

*4? <- ( * ) , M g+ i) <„«+«, 

and 


= v {t) _ 


,(*+ 2 ) 


< 0.5 v 


(*) 


for t = t 0 ,t 0 + 2 ,... . 


write 


Proof. Assume that t = to, to + 2,... . Rearranging the definition of , we may 
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We have that 


r (t) _ r '(to) r '(t 0 +2) 


and therefore 


j,W _ r , ( t o) r / (*o+2) 


'(*) 


„(*) _ '(to) '(to + 2) 

b j ~ b j b j 


„(*) _ e '(to) /(to+2) 

b j ~ b j b j 


At) 

b 3 


|A (t) || = ||B (t) || =1, 


| A (*)ii = || B (*)|i = i. 


Substituting this into the above yields 


(t) 

Aj 


/(to) '(to) /(t 0 +2) /(to+2) 

i b j ' i b j 

,/(t+2) /(t+2) /(t+4) /(t+4) 

i b j i b j 


■A A) 


f(t 0 ) '(to)j(to+2) /(to+2) 


- 1 


v -1 


-1. 


Applying the definition of to this, we obtain 

,,(t) _ ( '(t+2) /(t+2) /(t+4) /(t+4) y-i _ i 
Aj ~ Vi i °j ) 

= exp(— logrf t+2) — logsJ t+2) — logr^ t+4) — logs^ t+4) — • 

= exp(|logr' (t+2) | + |logsf +2) | + |logr' (t+4) | + |logs' (i+4) | 

< exp (2 w ( ' t+2 ' ) + 2w ( ' t+i ' > H-) - 1. 


- 1 


- 1 


We define 


t/dl = exp (2w {t+2) + 2w (t+4) H-) - 1, z/ t+1) = v {t) , 


thereby guaranteeing that ^ as the lemma states. Since ||A^||||B^|| 

is monotonically nonincreasing, so is its relative distance to its limit, meaning that 


and hence 


n (t+1) 

rij 


< M 


(t) 

ij 


+ +1) 

Aj 


< $ < !/W = + +1 > 


This proves another condition in the statement of the lemma, leaving us only with 
the condition A t+2 ' ) < 0.5 i/W to prove. 

By Lemma 20, 

w {t+3) < 0.5 w {t+2) w (t+4) < w (t+3) < 0.5 w (t+2) 

w (t+5) < Q.5w (t+4) < 0.25u; (t+2) i+ +6) < w (t+5) < 0.25z+ +2) 


Therefore 

uM +2) = (0.5 + 0.25 + • • •)w/ (t+2) 

= 0.5u/ ( *+ 2) + 0.25+ i+2) + ■ • • 

> + t+4) + w {t+6) H-, 

and thus 

exp(2^++ > exp(2u/ t+4 ) + 2w (t+6) + •••)• 
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Applying this bound to the definition of we obtain 

z/ 4 ) = exp (2u/ 4+2) + 2u; (4+4) H-) - 1 

= exp(2w (t+2) ) exp(2w (t+4) + 2w (t+6) H-) - 1 

> ^exp (2u; (4+4) + 2 iz; (4+6) H-- 1. 

We define x = exp(2u/ 4+4 ) + 2w^ t+6 ^ + •••), so that the above expression has the form 
x 2 — 1 and z/ t+2 l = x — 1. The rest of the proof is: 

i/W > a: 2 - 1 = (2 + x - l)(x - 1) = (2 + z/ (4+2) ) z/ 4+2) > 2 i/ (4+2) , 


which implies that z/ 4 + 2 ) < 0.5 z/ 4 ) as the lemma states. 

Finally, we show that the analysis in Lemma 21 is asymptotically sharp. 
Lemma 22. There are matrices A and B and indices i and j so that 

T { ij +1) = 1$ , /4j +2) /4? °' 5 f° r 1 = *o> *o + 2,- 


Proof. Let 




0 

1 


□ 


for some integer v, let us start the iteration with an O step, and assume that t = 
to, to + 2,... . A straightforward calculation, which we omit for brevity, shows that 


A (t) 

B (t) 

A (t+i) 

B o+i) 


1 2 

1 O' 
0 1 


0 

_2' u — ^o)/ 2 


l 

1 2 


0 

_2"U-(t —1 0 )/2 — 1 


0 2 


0 

— 2 ' u- ( t-t o)/ 2-1 



r r «] 

1 1 


1 


5 

r 2 () 


1 

5 


-I 0 ' 


1 





2' 

-2 1 

5 

fr (t+1) l 
' 1 

r (*+l) 
J 2 

= 

1 1 

i_i 

J 

■,(*+!)■ 

. 2 

= 



2-2’'(l-2 _(t_t 0 )/2 ) 


and therefore 


'1 o' 

rW — 

1 

o' 


r W 
'1 


1 


r (*)i 

s i 


1 

l i 

5 — 

0 

l 

5 

r (*) 

/ 2 


1 

5 

(*) 
|_ S 2 J 


2-2” 


and 

||A (4) || = ||B (4) || = |!A (4+1) || = ||B (4+1) || = ||A W || = ||B W || = 1. 
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Substituting the above into the definition of fj,f) yields 


(t) _ (t+i) _ IrW^HAWlUlBWlI -rW S W||AW||||BW| 
M 22 — M 22 — 


(28) 


|rW4* } ||AW| 

|2-2”(l-2 _(t_t oR 2 ) _ 2-2" 


B 


(*)i 


_ 2 2”-(*- t o)/2 _ 


Letting x = 2 2 1 0>/ , so that the above expression has the form x 2 — 1 and 

^22 2 ’ = x — 1, we have that 

M 22 = z 2 -1 = (2 + :e- l)(x - 1) = (2 + ^ 22 +2) ) M 22 +2) , 

and therefore 

^2 +2) M2 ) =l/(2 + A*22 +2) ) ■ 

From (28) we conclude that t 0, and therefore ^22 "^ / M 22 ~^ 0.5. □ 

Appendix E. Proof of Theorem 14. 

Proof. In the proof of Lemma 21 (see D), we show that for all i,j and t = 
to, to + 2,..., 


n (t+1) < u {t) 

rij — rij 


fjfjj < exp (2u/*+ 2 ) + 2w (t+4 ) 4 -) - 1, 

exp(2tu (t+4) + 2w (t+6) H-) < exp(2w (t + 2 )) . 


Putting these three statements together yields 

/4‘ +1) < My < exp(2w (t +2) + 2 W 0+ 4 ) + •••)- 1 

= exp (2 W 0+2)) ex p(2«;0+4) + 2w (t+6) + ...)_! 

< exp(2u;0+ 2 )) exp(2w/ <+2 )) — l 

= exp(4«;0+ 2 )) - 1. 


Lemma 20 guarantees that w/*+ 2 ) < uiO+L, and substituting this into the above yields 
(29) My +1 ^ < exp(4w;0 +1 ^ — 1. 

Similarly, 

/iy +2) < exp(2u> (t + 4 ) + 2w (t+6) H-) - 1 < exp(2w ( *+ 2 )) - 1. 

Next, let us prove the first statement of the theorem. Assume that 
(1 + t)~ i < p'^ t+1 ' > < (1+ t) a 


(30) 
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for all k. Taking logarithms yields 

-0.25 log(l + t) < logp , fc (t+1) < 0.25 log(l + r), 

or equivalently 

|logp' fe (t+il | < 0.251og(l + r), 

and therefore 

w (t+i) _ max|logp^ t+1 ^ I < 0.251og(l + r). 
k 

Substituting this into (29), we find that 

/4* +1) < exp(4u; (t+1) ) - 1 < exp(4 • 0.251og(l + r)) - 1 = r, 

for all i,j, which proves the first statement of the theorem. 

Let us prove the second statement of the theorem. Assume that 

(1 + t)- 5 < rf +2) , (1 + r)-5 < sf +2) 

for all i and j. Taking logarithms yields 

—0.51og(l + t) < logr^ t+2 ) , —0.51og(l + r) < logs^ i+2 ^ . 

We show in the proof of Lemma 19 that r^ t+2 ^ < 1, and therefore \ogrf t+2 ^ < 0 and 
similarly for s^ t+2 ^. Hence 

|log r^ t+2) | < 0.51og(l + r), |logs'^ t+2) | < 0.51og(l + r), 

and hence 

w (t+ 2 ) = maxfmax|logr( ( ' t+2 ' > |,max|logs , ^ +2 ^|') < 0.5log(l + r). 

Vi 1 j J / 

Substituting this into (30) yields 

< exp(2rt/ t+2 )) — 1 < exp(2 • 0.5log(l + r)) — 1 = r, 


for all i,j , which proves the second statement of the theorem. 


□ 


